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Conventional  Approach; 
Static  IP  Cores 

>  IP  cores  improve  productivity  and  reduce  time-to-market. 

>  e.g.  Xilinx  LogiCore  library; 

FFT  for  N=16, 64, 256  and  1024  on  16-bit  complex  numbers 


application 


May  not  match  the  application’s  needs: 
U  parameters,  speed,  power,  area  and 
their  trade-off. 
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Alternative  Approach; 

IP  Core  Generation 

>  Generate  IP  cores  to  match  specific  application  requirements 

(speed,  area,  power,  numerical  accuracy,  and  I/O  bandwidth...) 


* 


Application  parameters 


Speed  /  area  / 
power  requirements 


Generator 

Evaluator 


Optimized 
IP  cores 


I 
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Design  space 


>  DSP  transform  design 
can  be  studied  at 
several  levels. 

>  More  math  knowledge 
involved 

=>  Bigger  design  space 
to  explore. 


/f  X  ElectricalrSf  Computer 

ENGINEERING 


Slide  4 


Desiqner’s  Focus 
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Problem 


>  Problem:  gap  between  transform  mathematics  and  hardware 
design 


vvnui  ±  isriuw- 


Linear  algebra 
Digital  signal  processing 
Adaptive  filter  theory ... 


A  hardware  engineer 


What  I  know: 


Finite  state  machine 
Pipelining 
Systolic  array ... 
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Bridge;  Formula 


>  Solution:  -  Formula  representation  of  DSP  transforms 

-  Automated  formula  manipulation  and  mapping 

Formula  example  DfTg  =  (F2  0 /4  )•  D  •  (/2  @  (/2  0  F2  •••))■  p 


math  guy 


A  hardware  engineer 


Representation 
Formula  Manipulation 
Mapping 


What  I  know: 


Linear  algebra 
Digital  signal  processing 
Adaptive  fiiter  theory ... 


What  I  know: 


Finite  state  machine 
Pipeiining 
Systolic  array ... 


/f  X  ElectricalrSf  Computer 

ENGINEERING 


Slide  6 


Curmgie  Mellon 


Outline 

>  Introduction 

>  Technical  Details  (illustrated  by  WHT  transform) 

□  What  are  the  degrees  of  design  freedom? 

□  How  do  we  explore  this  design  space? 


>  Experimental  Results 

>  Summary  and  Future  work 
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V\i^lsh-Hadamard  Transform 

>  WhyWHT? 

□  Typical  access  pattern  for  a  DSP  transform 

□  Close  to  2-power  FFT 

□  Study  important  construct  0 

>  Definition 

WHT^„ 

WHT2„  =  F2  0  F2  0 ...  0  F2  F2  = 

^ 

n  fold 


1  1 
1  -1 


"1  r 

WHT  = 

L 
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•  Addition  •  Subtraction 
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Pease  Algorithm 


WHT^,  =  (F,  ®  /, )(/,  ®F,®  I, )(/,  ®  F, ) 

=  L/  (/,  ®  F,  )g(/,  ®  F, )  L/  (/,  ®  F, ) 

T 


stride  permutation  12*^ 
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Possibility  for  vertical  folding 
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Pease  Algorithm 


M/r,3  =  {F,  @  I,  tl,  ®F,®  I, )(/,  ®  F, ) 

)  L,\F®F,) 


=  L^il  ®  F. 

Regular  routing  ^ 


an  F,  block 


Possibility  for  horizontal  folding 


12  F2  blocks 
total 
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4  F2  blocks 


1 F2  block 
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Challenge  in  Vertical  Folding 


How  to  fold 
these  wires? 


Q  ports 


>  straightforward  approach:  Memory-based  reordering 


□  Extra  control  logic  to  reorder  address 

□  Computation  speed  is  limited  by  memory  speed 


>  Ad-hoc  approach;  Register  routing 

□  Hard  to  automate  the  process 

>  Our  approach:  formula-based  matrix  factorization 
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Factorization  of  Stride  Permutation 


Mellon 


/ 


N 


)J 


N /2^  N 


-  N/Q  ®  ^2 


0n-q-l 

•  Y\  ®  2^+*+i  ^ 


L2^  has  Q 
input  ports 

Q=2^,  N=2'' 


^(j 

64)4  M  ('^32)4  R  ("^l  6/4 


'8 


Example  of  (L2%(N=64,  Q  =  4) 


„  [11. 3.H.  Takala  etc.,  “Multi-Port  Interconnection  Networks  for  Radix-R  Algorithms”,  ICASSPOl 
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Freedom  in  Horizontal  Folding 

>  WHT2'^  has  n  horizontal  stages  in  the  flattened  design 

□  The  divisors  of  n  are  all  the  possible  folding  degrees 

□  Example:  HF  degrees  of  WHTj®  can  be  1, 2, 3, 6 

>  Effects  of  more  horizontal  folding  degree 


Less  pipeline 
depth 
=>  lower 
throughput 


Latency 

(cycle) 

Same 

Throughput 
(op  /  cycle) 

Lower  ^ 

Area 

less  adders,  more 
muxs  &  wires 

Speed 

Not  clear 
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Freedom  in  Vertical  Folding 

>  WHT2"  has  2"  vertical  ports  in  the  flattened  design 

□  1, 2, 4...  2"'^ are  all  possible  folding  degrees 

□  Example:  VF  degrees  of  WHT2®  could  be  1, 2,4, ...  32 

>  Effects  of  more  vertical  folding  degree 


Latency 

(cycle) 

Longer 

Throughput 
(op  /  cycle) 

Lower 

Area 

less  adders,  more 
regs  &  muxs 

Speed 

Not  clear 

Less  HO 
bandwidth 
=>  longer 
computation 
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Outline 


>  Technical  Details 

>  Experimental  Results 

>  Summary  and  Future  work 
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Design  Space  Exploration 


HF  factor  VF  factor 

(1,2, 3, 6^  X  1(1,2, 4, ...  32)  =  24  different  designs 

Bit-width  (8) — —  Transform  size(64) 


WHT 

Generator 


Technology 

Libary 

XilinxFPGA 

Synthesis 


Performance 

requirement 


t 


Xilinx  FPGA 
Place&Route 
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degree 


/f  X  Electrical& Computer 

ENGINEERING 


Slide  19 


Latency  vs.  Folding  Degrees  (V\iHT64) 


Mellon 


Latency  (ns) 


VF  (degree 
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Latency  vs.  Folding  Degrees  (V\iHT64) 


Mellon 
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Latency  (ns) 


1200 


800 


600 


400 


200 


32  16 


8  4  2 

VF  degree 
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Latency  vs.  Folding  Degrees  (V\iHT64) 


Mellon 


Latency  is  almost 
unaffected  by  HF, 
except  comparing 
flattened  design  with 
folded  design 


Latency  (ns) 


VF  degree 
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Throughput  vs.  Folding  Degrees 
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Throughput  (MOP/sec) 


Folding  always  lowers 
throughput 
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Comparison  with  an  Existing 

Design 


>  WHTg 

□  8  bit  fixed-point 

□  FPGA:  Xilinx  Virtex  xcvl000e-fg680  Speed  grade:  -8 

□  Compare  our  fastest  generated  designs  against  results  reported  by  Amira,  et 
al.  [2] 


n  Design  in  [2] 
n  Our  fastest  design 


60%  more  area 

80%  reduction  in  latency 
13  times  higher  throughput 


Area  (#of  slices)  Latency(ns) 


Throughput(MOP/s) 


[2]  AAmira  et  al.,  “Novel  FPGA  Implementations  of  Walsh -Hardamard  Transforms  for  Signal  Processing”,  Visior 
j(  \  Electrical  &  Computer  and  Signal  Processing,  I  EE  Proceedmo^^olume^d^ssue^^e^OM 
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WHT 


Comparison  with  an  Existing 

Design 


8 


□  8  bit  fixed-point 

□  FPGA:  Xilinx  Virtex  xcvl000e-fg680  Speed  grade:  -8 

□  Compare  our  smallest  generated  designs  against  results  reported  by  Amira, 
et  al.  [2] 


n  Design  in  [2] 

I  Our  smallest  design 


Less  area 
Shorter  latency 
Higher  throughput 


Area  (#of  slices)  Latency(ns)  Throughput(MOP/s) 

[2]  AAmira  et  al.,  “Novel  FPGA  Implementations  of  Walsh -Hardamard  Transforms  for  Signal  Processing’’,  Visior 
/(  \  Electrical  &  Computer  Processing,  I  EE  Proceedings- ,  Volume:  148  Issue:  6 ,  Dec.  2001 
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Summary 

>  Large  performance  variations  over  the  design  space  of 
horizontal  and  vertical  folding 

>  Automatic  design  space  exploration  through  formula 
manipulation  and  mapping  can  find  the  best  trade-off 

Performance 


Electrical  &  Computer 
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Future  work 


More  DSP 
transform 


Representation 
Formula  Manipulation 
Mapping 


More  design 
decisions 


DFT 

DCT 

DST 

DWT 


Pipelining 
Systolic  array 
Distributed  Arithmetic 
Fix-point  vs.  Floating-point 
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Thank  you  ! 


Contact:  Fang  Fang 

Email:  ffang@cmu.edu 

URL:  www.ece.cmu.edu/~ffang 
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